Custom training component for distributed training #240
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
VertexAI offers training "Custom Jobs" which is a little more flexible than their baseline kfp component.
For example, this was the only way I could set number of replicas to distribute across multiple workers at the vertex level.
This PR introduces the config key
distributed_trainingwhich allows you to enable the custom training job operator kfp component if you want to distribute training with >1 replica.For now, the settings for distribution (ie: number of replicas) is configured to be the same for all enabled tasks. ie) You cannot set different number of replicas depending on the task
I admit, I do wish the vertex /kubeflow would allow replicas to be set on the standard components. That would make things simpler
Looking for feedback & thoughts! Thanks
PR Checklist